Jonathan Christyadi (502705) - AI Core 02
This notebook aims at predicting the likelihood of a link being a phishing link or a legitimate link with a focus on exploring and testing hypotheses that necessitate further research.
import sklearn
import pandas as pd
import seaborn
import numpy as np
print("scikit-learn version:", sklearn.__version__) # 1.1.3
print("pandas version:", pd.__version__) # 1.5.1
print("seaborn version:", seaborn.__version__) # 0.12.1
scikit-learn version: 1.4.1.post1 pandas version: 2.2.1 seaborn version: 0.13.2
After loading the dataset, I found out some inconsistencies among the data. First the label of the link (phishing or legitimate) can be changed into binary format. Also, for domain_with_copyright column, some are in binary and some are written in alphabets, for example: zero, One, etc.
df = pd.read_csv("Data\dataset_link_phishing.csv", sep=',', index_col=False, dtype='unicode')
df.head()
| id | url | url_length | hostname_length | ip | total_of. | total_of- | total_of@ | total_of? | total_of& | ... | domain_in_title | domain_with_copyright | whois_registered_domain | domain_registration_length | domain_age | web_traffic | dns_record | google_index | page_rank | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | http://www.progarchives.com/album.asp?id=61737 | 46 | 20 | 0 | 3 | 0 | 0 | 1 | 0 | ... | 1 | one | 0 | 627 | 6678 | 78526 | 0 | 0 | 5 | phishing |
| 1 | 1 | http://signin.eday.co.uk.ws.edayisapi.dllsign.... | 128 | 120 | 0 | 10 | 0 | 0 | 0 | 0 | ... | 1 | zero | 0 | 300 | 65 | 0 | 0 | 1 | 0 | phishing |
| 2 | 2 | http://www.avevaconstruction.com/blesstool/ima... | 52 | 25 | 0 | 3 | 0 | 0 | 0 | 0 | ... | 1 | zero | 0 | 119 | 1707 | 0 | 0 | 1 | 0 | phishing |
| 3 | 3 | http://www.jp519.com/ | 21 | 13 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 1 | one | 0 | 130 | 1331 | 0 | 0 | 0 | 0 | legitimate |
| 4 | 4 | https://www.velocidrone.com/ | 28 | 19 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 0 | zero | 0 | 164 | 1662 | 312044 | 0 | 0 | 4 | legitimate |
5 rows × 87 columns
# Taking a look at the data types of the columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 19431 entries, 0 to 19430 Data columns (total 87 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 19431 non-null object 1 url 19431 non-null object 2 url_length 19431 non-null object 3 hostname_length 19431 non-null object 4 ip 19431 non-null object 5 total_of. 19431 non-null object 6 total_of- 19431 non-null object 7 total_of@ 19431 non-null object 8 total_of? 19431 non-null object 9 total_of& 19431 non-null object 10 total_of= 19431 non-null object 11 total_of_ 19431 non-null object 12 total_of~ 19431 non-null object 13 total_of% 19431 non-null object 14 total_of/ 19431 non-null object 15 total_of* 19431 non-null object 16 total_of: 19431 non-null object 17 total_of, 19431 non-null object 18 total_of; 19431 non-null object 19 total_of$ 19431 non-null object 20 total_of_www 19431 non-null object 21 total_of_com 19431 non-null object 22 total_of_http_in_path 19431 non-null object 23 https_token 19431 non-null object 24 ratio_digits_url 19431 non-null object 25 ratio_digits_host 19431 non-null object 26 punycode 19431 non-null object 27 port 19431 non-null object 28 tld_in_path 19431 non-null object 29 tld_in_subdomain 19431 non-null object 30 abnormal_subdomain 19431 non-null object 31 nb_subdomains 19431 non-null object 32 prefix_suffix 19431 non-null object 33 random_domain 19431 non-null object 34 shortening_service 19431 non-null object 35 path_extension 19431 non-null object 36 nb_redirection 19431 non-null object 37 nb_external_redirection 19431 non-null object 38 length_words_raw 19431 non-null object 39 char_repeat 19431 non-null object 40 shortest_words_raw 19431 non-null object 41 shortest_word_host 19431 non-null object 42 shortest_word_path 19431 non-null object 43 longest_words_raw 19431 non-null object 44 longest_word_host 19431 non-null object 45 longest_word_path 19431 non-null object 46 avg_words_raw 19431 non-null object 47 avg_word_host 19431 non-null object 48 avg_word_path 19431 non-null object 49 phish_hints 19431 non-null object 50 domain_in_brand 19431 non-null object 51 brand_in_subdomain 19431 non-null object 52 brand_in_path 19431 non-null object 53 suspecious_tld 19431 non-null object 54 statistical_report 19431 non-null object 55 nb_hyperlinks 19431 non-null object 56 ratio_intHyperlinks 19431 non-null object 57 ratio_extHyperlinks 19431 non-null object 58 ratio_nullHyperlinks 19431 non-null object 59 nb_extCSS 19431 non-null object 60 ratio_intRedirection 19431 non-null object 61 ratio_extRedirection 19431 non-null object 62 ratio_intErrors 19431 non-null object 63 ratio_extErrors 19431 non-null object 64 login_form 19431 non-null object 65 external_favicon 19431 non-null object 66 links_in_tags 19431 non-null object 67 submit_email 19431 non-null object 68 ratio_intMedia 19431 non-null object 69 ratio_extMedia 19431 non-null object 70 sfh 19431 non-null object 71 iframe 19431 non-null object 72 popup_window 19431 non-null object 73 safe_anchor 19431 non-null object 74 onmouseover 19431 non-null object 75 right_clic 19431 non-null object 76 empty_title 19431 non-null object 77 domain_in_title 19431 non-null object 78 domain_with_copyright 19431 non-null object 79 whois_registered_domain 19431 non-null object 80 domain_registration_length 19431 non-null object 81 domain_age 19431 non-null object 82 web_traffic 19431 non-null object 83 dns_record 19431 non-null object 84 google_index 19431 non-null object 85 page_rank 19431 non-null object 86 status 19431 non-null object dtypes: object(87) memory usage: 12.9+ MB
# Sampling the dataset
df.sample(10)
| id | url | url_length | hostname_length | ip | total_of. | total_of- | total_of@ | total_of? | total_of& | ... | domain_in_title | domain_with_copyright | whois_registered_domain | domain_registration_length | domain_age | web_traffic | dns_record | google_index | page_rank | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 61 | 61 | https://en.wikipedia.org/wiki/Switched_at_Birt... | 58 | 16 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 0 | zero | 0 | 902 | 7133 | 12 | 0 | 0 | 7 | legitimate |
| 17573 | 9572 | http://outlook-webapp-portal.el.r.appspot.com/... | 54 | 38 | 1 | 5 | 2 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 228 | 5616 | 0 | 0 | 1 | 5 | phishing |
| 9929 | 1928 | https://www.scilearn.com/ | 25 | 16 | 1 | 2 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 219 | 8914 | 74155 | 0 | 0 | 5 | legitimate |
| 2936 | 2936 | http://www.whatsapps-invites.zzux.com/ | 38 | 30 | 0 | 3 | 1 | 0 | 0 | 0 | ... | 1 | one | 0 | 116 | 7189 | 481145 | 1 | 1 | 1 | phishing |
| 1291 | 1291 | http://support-appleld.com.secureupdate.duilaw... | 76 | 50 | 1 | 4 | 1 | 0 | 0 | 0 | ... | 1 | zero | 0 | 14 | 4003 | 5816617 | 0 | 1 | 0 | phishing |
| 15484 | 7483 | http://www.davidcourtemarche.com/image/ | 39 | 25 | 1 | 2 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 137 | 959 | 0 | 0 | 1 | 0 | phishing |
| 14868 | 6867 | http://caspianglobalservices.com/awosoke/fud/f... | 69 | 25 | 1 | 2 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 217 | 1975 | 0 | 0 | 1 | 0 | phishing |
| 16893 | 8892 | https://jabkzahrimasjoun.blogspot.com/ | 38 | 29 | 1 | 2 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 373 | 7296 | 0 | 0 | 1 | 5 | phishing |
| 4873 | 4873 | http://calzados32.webcindario.com/app/facebook... | 268 | 26 | 0 | 3 | 0 | 0 | 1 | 1 | ... | 1 | zero | 0 | 952 | 7083 | 17964 | 0 | 1 | 3 | phishing |
| 18156 | 10155 | http://www.cassa7c.com/boa/boa/index.html | 41 | 15 | 1 | 3 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 237 | 128 | 0 | 1 | 1 | 0 | phishing |
10 rows × 87 columns
After understanding the data on the sample, I found that some data are not in a good form and there is a room for improvement, such as the domain_with_copyright and status columns.
df['status'].unique()
array(['phishing', 'legitimate'], dtype=object)
As you can see on the status column, there are only 2 values, phishing and legitimate. Which mean I can transform it into binary values (0 and 1).
df['status'] = df['status'].map({'phishing': 1, 'legitimate': 0})
df.head()
| id | url | url_length | hostname_length | ip | total_of. | total_of- | total_of@ | total_of? | total_of& | ... | domain_in_title | domain_with_copyright | whois_registered_domain | domain_registration_length | domain_age | web_traffic | dns_record | google_index | page_rank | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | http://www.progarchives.com/album.asp?id=61737 | 46 | 20 | 0 | 3 | 0 | 0 | 1 | 0 | ... | 1 | one | 0 | 627 | 6678 | 78526 | 0 | 0 | 5 | 1 |
| 1 | 1 | http://signin.eday.co.uk.ws.edayisapi.dllsign.... | 128 | 120 | 0 | 10 | 0 | 0 | 0 | 0 | ... | 1 | zero | 0 | 300 | 65 | 0 | 0 | 1 | 0 | 1 |
| 2 | 2 | http://www.avevaconstruction.com/blesstool/ima... | 52 | 25 | 0 | 3 | 0 | 0 | 0 | 0 | ... | 1 | zero | 0 | 119 | 1707 | 0 | 0 | 1 | 0 | 1 |
| 3 | 3 | http://www.jp519.com/ | 21 | 13 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 1 | one | 0 | 130 | 1331 | 0 | 0 | 0 | 0 | 0 |
| 4 | 4 | https://www.velocidrone.com/ | 28 | 19 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 0 | zero | 0 | 164 | 1662 | 312044 | 0 | 0 | 4 | 0 |
5 rows × 87 columns
After a closer look, I spotted that there are some inconsistencies with the value on domain_with_copyright column, for example One and one. Similarly, I want to transform it into binary value 0 and 1, instead of the string
df['domain_with_copyright'].unique()
array(['one', 'zero', 'One', 'Zero', '1', '0'], dtype=object)
df['domain_with_copyright'] = df['domain_with_copyright'].map({'one': 1, 'zero': 0, 'Zero': 0, 'One': 1,'1': 1, '0': 0}).astype(int)
df['domain_with_copyright'].unique()
array([1, 0])
# Calculate the total number of missing values in the DataFrame
total_na = df.isna().sum()
# Calculate the total number of missing values in the DataFrame
total_null = df.isnull().sum()
total_null.sum()
0
# Finding columns with binary values
def count_binary_columns(df):
results = []
counter = 0
for col in df.columns:
counter += 1
if df[col].isin([0, 1]).all():
results.append(col)
return results, counter
count_binary_columns(df)
(['domain_with_copyright', 'status'], 87)
df = df.drop(columns=['id', 'url'])
df.head()
| url_length | hostname_length | ip | total_of. | total_of- | total_of@ | total_of? | total_of& | total_of= | total_of_ | ... | domain_in_title | domain_with_copyright | whois_registered_domain | domain_registration_length | domain_age | web_traffic | dns_record | google_index | page_rank | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 46 | 20 | 0 | 3 | 0 | 0 | 1 | 0 | 1 | 0 | ... | 1 | 1 | 0 | 627 | 6678 | 78526 | 0 | 0 | 5 | 1 |
| 1 | 128 | 120 | 0 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 300 | 65 | 0 | 0 | 1 | 0 | 1 |
| 2 | 52 | 25 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 119 | 1707 | 0 | 0 | 1 | 0 | 1 |
| 3 | 21 | 13 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 130 | 1331 | 0 | 0 | 0 | 0 | 0 |
| 4 | 28 | 19 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 164 | 1662 | 312044 | 0 | 0 | 4 | 0 |
5 rows × 85 columns
df['whois_registered_domain'].unique()
array(['0', '1'], dtype=object)
print(df['status'].value_counts())
df['status'].value_counts().plot(kind='bar', title='Count the target variable')
status 0 9716 1 9715 Name: count, dtype: int64
<Axes: title={'center': 'Count the target variable'}, xlabel='status'>
A heatmap will be used to select a suitable set of features to predict the status target upon. At this stage, I have zero idea which feature to use and I utilized heatmap to find features with the most corellation with the target feature.
First, to determine which feature to be used on the model, I want to visualize the correlation of the features.
import seaborn as sns
import matplotlib.pyplot as plt
corr = df.corr()
plt.figure(figsize=(100, 100))
plot = sns.heatmap(corr, annot=True, fmt='.2f', linewidths=2)
# Sorting the correlation values with the target variable in descending order
corr.drop('status').sort_values(by='status', ascending=False).plot.bar(y='status', title='Correlation with the target variable', figsize=(20, 10))
<Axes: title={'center': 'Correlation with the target variable'}>
#
# Finding the most correlated features with the target variable based on numerical featrures excluding NaN values
correlation_matrix = df.corr(numeric_only=True)
sorted_corr = correlation_matrix.sort_values(by='status',ascending=False)
sorted_corr
| url_length | hostname_length | total_of. | total_of- | total_of? | total_of/ | total_of_www | ratio_digits_url | phish_hints | nb_hyperlinks | domain_in_title | domain_with_copyright | google_index | page_rank | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| status | 0.244348 | 0.240681 | 0.205302 | -0.102849 | 0.293920 | 0.240892 | -0.444561 | 0.356587 | 0.337287 | -0.341295 | 0.339519 | -0.175469 | 0.730684 | -0.509761 | 1.000000 |
| google_index | 0.233061 | 0.216919 | 0.208764 | -0.018285 | 0.202097 | 0.289212 | -0.357215 | 0.323157 | 0.279906 | -0.269482 | 0.265933 | -0.144499 | 1.000000 | -0.386721 | 0.730684 |
| ratio_digits_url | 0.434626 | 0.171761 | 0.224194 | 0.110341 | 0.325739 | 0.206925 | -0.211165 | 1.000000 | 0.096967 | -0.128915 | 0.152393 | -0.027357 | 0.323157 | -0.181489 | 0.356587 |
| domain_in_title | 0.124224 | 0.218850 | 0.108442 | 0.009843 | 0.092191 | 0.088462 | -0.178402 | 0.152393 | 0.125857 | -0.217548 | 1.000000 | 0.076105 | 0.265933 | -0.332742 | 0.339519 |
| phish_hints | 0.332000 | -0.019901 | 0.168765 | 0.065562 | 0.208052 | 0.501321 | -0.090812 | 0.096967 | 1.000000 | -0.112423 | 0.125857 | -0.066130 | 0.279906 | -0.203464 | 0.337287 |
| total_of? | 0.523172 | 0.164129 | 0.353133 | 0.035958 | 1.000000 | 0.243749 | -0.115337 | 0.325739 | 0.208052 | -0.112604 | 0.092191 | -0.046123 | 0.202097 | -0.123151 | 0.293920 |
| url_length | 1.000000 | 0.217586 | 0.447198 | 0.406951 | 0.523172 | 0.486490 | -0.067973 | 0.434626 | 0.332000 | -0.098101 | 0.124224 | -0.004281 | 0.233061 | -0.099900 | 0.244348 |
| total_of/ | 0.486490 | -0.061203 | 0.242216 | 0.204793 | 0.243749 | 1.000000 | -0.005628 | 0.206925 | 0.501321 | -0.073183 | 0.088462 | -0.023213 | 0.289212 | -0.113861 | 0.240892 |
| hostname_length | 0.217586 | 1.000000 | 0.406834 | 0.059480 | 0.164129 | -0.061203 | -0.130991 | 0.171761 | -0.019901 | -0.104614 | 0.218850 | 0.073107 | 0.216919 | -0.160621 | 0.240681 |
| total_of. | 0.447198 | 0.406834 | 1.000000 | 0.049303 | 0.353133 | 0.242216 | 0.068290 | 0.224194 | 0.168765 | -0.093994 | 0.108442 | 0.057320 | 0.208764 | -0.098752 | 0.205302 |
| total_of- | 0.406951 | 0.059480 | 0.049303 | 1.000000 | 0.035958 | 0.204793 | 0.045756 | 0.110341 | 0.065562 | -0.004513 | 0.009843 | 0.020914 | -0.018285 | 0.104676 | -0.102849 |
| domain_with_copyright | -0.004281 | 0.073107 | 0.057320 | 0.020914 | -0.046123 | -0.023213 | 0.087826 | -0.027357 | -0.066130 | 0.192159 | 0.076105 | 1.000000 | -0.144499 | 0.057127 | -0.175469 |
| nb_hyperlinks | -0.098101 | -0.104614 | -0.093994 | -0.004513 | -0.112604 | -0.073183 | 0.114259 | -0.128915 | -0.112423 | 1.000000 | -0.217548 | 0.192159 | -0.269482 | 0.221066 | -0.341295 |
| total_of_www | -0.067973 | -0.130991 | 0.068290 | 0.045756 | -0.115337 | -0.005628 | 1.000000 | -0.211165 | -0.090812 | 0.114259 | -0.178402 | 0.087826 | -0.357215 | 0.110745 | -0.444561 |
| page_rank | -0.099900 | -0.160621 | -0.098752 | 0.104676 | -0.123151 | -0.113861 | 0.110745 | -0.181489 | -0.203464 | 0.221066 | -0.332742 | 0.057127 | -0.386721 | 1.000000 | -0.509761 |
# Get all the correlated features with the target variable
num_features = len(sorted_corr['status']) # 15 features
sorted_corr['status'].head(num_features)
status 1.000000 google_index 0.730684 ratio_digits_url 0.356587 domain_in_title 0.339519 phish_hints 0.337287 total_of? 0.293920 url_length 0.244348 total_of/ 0.240892 hostname_length 0.240681 total_of. 0.205302 total_of- -0.102849 domain_with_copyright -0.175469 nb_hyperlinks -0.341295 total_of_www -0.444561 page_rank -0.509761 Name: status, dtype: float64
# List the features from the previous step into a list
selected_features = ['google_index', 'ratio_digits_url', 'domain_in_title', 'phish_hints', 'total_of?', 'url_length', 'total_of/','hostname_length','total_of.', 'total_of-','domain_with_copyright','nb_hyperlinks','total_of_www','page_rank']
# selected_features = sorted_corr['status'].head(num_features).index.tolist()
df[selected_features] = df[selected_features].apply(pd.to_numeric, errors='coerce')
# Check the data types of the selected columns after conversion
print(df[selected_features].dtypes)
# Check if 'status' column exists and has categorical or numerical data
print(df['status'].dtype)
# Create a DataFrame with the selected columns
selected_df = df[selected_features + ['status']]
selected_df.head()
google_index int64 ratio_digits_url float64 domain_in_title int64 phish_hints int64 total_of? int64 url_length int64 total_of/ int64 hostname_length int64 total_of. int64 total_of- int64 domain_with_copyright int32 nb_hyperlinks int64 total_of_www int64 page_rank int64 dtype: object int64
| google_index | ratio_digits_url | domain_in_title | phish_hints | total_of? | url_length | total_of/ | hostname_length | total_of. | total_of- | domain_with_copyright | nb_hyperlinks | total_of_www | page_rank | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.108696 | 1 | 0 | 1 | 46 | 3 | 20 | 3 | 0 | 1 | 143 | 1 | 5 | 1 |
| 1 | 1 | 0.054688 | 1 | 2 | 0 | 128 | 3 | 120 | 10 | 0 | 0 | 0 | 0 | 0 | 1 |
| 2 | 1 | 0.000000 | 1 | 0 | 0 | 52 | 4 | 25 | 3 | 0 | 0 | 3 | 1 | 0 | 1 |
| 3 | 0 | 0.142857 | 1 | 0 | 0 | 21 | 3 | 13 | 2 | 0 | 1 | 404 | 1 | 0 | 0 |
| 4 | 0 | 0.000000 | 0 | 0 | 0 | 28 | 3 | 19 | 2 | 0 | 0 | 57 | 1 | 4 | 0 |
# Count the number of binary columns in the selected features
features_binary = count_binary_columns(df[selected_features])
features_binary
(['google_index', 'domain_in_title', 'domain_with_copyright'], 14)
from sklearn.preprocessing import StandardScaler
# Scale the data
selected_df = selected_df.dropna()
scaler = StandardScaler()
selected_df[selected_features] = scaler.fit_transform(selected_df[selected_features])
from pandas.plotting import scatter_matrix
scatter_matrix(selected_df, alpha=1, figsize=(60, 60), diagonal='hist')
plt.show()
# Create pairplot
sns.pairplot(selected_df, hue='status', palette='Set1')
# Add legends
plt.legend(title='Status', labels=['Phishing', 'Legitimate'])
# Show the plot
plt.show()
target = 'status'
X = df[selected_features]
y = df[target]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
print("There are in total", len(X), "observations, of which", len(X_train), "are now in the train set, and", len(X_test), "in the test set.")
There are in total 19431 observations, of which 15544 are now in the train set, and 3887 in the test set.
# SUPPORT VECTOR MACHINE SVM
from sklearn.svm import SVC
model = SVC()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("Accuracy:", score)
Accuracy: 0.8505273990223823
from sklearn.metrics import classification_report
predictions = model.predict(X_test)
report = classification_report(y_test, predictions)
print(report)
precision recall f1-score support
0 0.87 0.82 0.85 1939
1 0.83 0.88 0.85 1948
accuracy 0.85 3887
macro avg 0.85 0.85 0.85 3887
weighted avg 0.85 0.85 0.85 3887
# LINEAR REGRESSION
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("R²:", score)
R²: 0.7019279787877617
import shap
# Shap explainer initialized with the model and training data
explainer = shap.Explainer(model, X_train)
# Calculate Shap values for the predictions made on the test set
shap_values = explainer.shap_values(X_test)
# Plot the Shap values using bee swarm plot
shap.summary_plot(shap_values, X_test)
IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
# K-NEAREST NEIGHBORS
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=4)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("Accuracy:", score)
Accuracy: 0.9194751736557757
# DECISION TREE
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(min_samples_leaf=40, min_samples_split=300)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("Accuracy:", score)
Accuracy: 0.9279650115770517
target_names = ["phishing", "legitimate"]
import matplotlib.pyplot as plt
plt.figure(figsize=(40,40))
from sklearn.tree import plot_tree
plot_tree(model, fontsize=8, feature_names=selected_features, class_names=target_names)
plt.show()